fix(geo): tokenization-based keyword matching to prevent false positives#503
Merged
fix(geo): tokenization-based keyword matching to prevent false positives#503
Conversation
Replace String.includes() with tokenization-based Set.has() matching
across the geo-tagging pipeline. Prevents false positives like "assad"
matching inside "ambassador" and "hts" matching inside "rights".
- Add src/utils/keyword-match.ts as single source of truth
- Decompose possessives/hyphens ("Assad's" → includes "assad")
- Support multi-word phrase matching ("white house" as contiguous)
- Remove false-positive-prone DC keywords ('house', 'us ')
- Update 9 consumer files across geo-hub, map, CII, and asset systems
- Add 44 tests covering false positives, true positives, edge cases
Co-authored-by: karim <mirakijka@gmail.com>
Fixes #324
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Address code review feedback: P1a: Add suffix-aware matching for plurals and demonyms so existing keyword lists don't regress (houthi→houthis, ukraine→ukrainian, iran→iranian, israel→israeli, russia→russian, taiwan→taiwanese). Uses curated suffix list + e-dropping rule to avoid false positives. P1b: Expand conflictTopics arrays in DeckGLMap and Map with demonym forms so "Iranian senate..." correctly registers as conflict topic. P2: Replace inline test functions with real module import via tsx. Tests now exercise the production keyword-match.ts directly.
The .mts test file wasn't covered by `node --test tests/*.test.mjs`. Add `npx tsx --test tests/*.test.mts` so test:data runs both suites.
- Use tsx as test runner for both .mjs and .mts (single invocation) - Removes ; separator which breaks on Windows cmd.exe - Add tsx to devDependencies so it works in offline/CI environments
- Add wordMatches() for suffix-aware phrase matching so "South Korean" matches keyword "south korea" and "North Korean" matches "north korea" - Add MIN_SUFFIX_KEYWORD_LEN=4 guard so short keywords like "ai", "us", "hts" only do exact-match (prevents "ais"→"ai", "uses"→"us" false positives) - Add 5 new tests covering both fixes (58 total, all passing)
Add compound suffixes (ians, eans, ans, ns, is) to handle plural demonym forms like "Iranians"→"iran", "Ukrainians"→"ukraine", "Russians"→"russia", "Israelis"→"israel". Adds 5 new tests (63 total).
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #324 — supersedes #330 by @princelevant (preserving authorship via
--author)String.includes()keyword matching caused false positives:"assad"matched inside"ambassador","hts"inside"rights","house"inside"housing", etc.Set.has()for O(1) lookups) with sub-part decomposition for possessives/hyphensKey design decisions
src/utils/keyword-match.tswithtokenizeForMatch(),matchKeyword(),matchesAnyKeyword(),findMatchingKeywords()"Assad's"→ Set contains both"assad's"and"assad"— fixes the possessive false-negative from fix: use word-boundary regex for geo-tagging keyword matching #330"white house"matches as contiguous ordered tokens, so"the house is painted white"does NOT match'house'(matched "housing"/"warehouse") and'us '(trailing-space hack)'hts'keyword kept — safe with tokenization sincetokens.has('hts')won't match "rights"Intentionally NOT changed
analysis-constants.ts(includesKeyword/containsTopicKeyword) — different pipeline, blast radius concernentity-index.ts— uses regex for match position extractioncountry-geometry.ts— already uses\bregex correctlyTest plan
npx tsc --noEmitpassesnode --test tests/geo-keyword-matching.test.mjs— 44 tests pass"French Ambassador outlines new strategy"does NOT match Damascus"Assad's forces advance in Syria"DOES match Damascus"Human rights groups condemn"does NOT match Damascus via 'hts'"White House announces budget"matches DC,"Housing market crashes"does NOTcc @princelevant — thank you for the original investigation and tokenization approach in #330!